Pandas
is the Python
library designed specifically for data analysis. The author of Python for Data Analysis, Wes McKinney, began developing Pandas
in 2008
while at AQR Capital Management out of need for a performant, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.
Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library. Right around that time, the library became popular in the
Python
community, and many more contributors joined the project making it one of the most vital and active data analysis libraries forPython
. (Wikipedia )
Pandas
can be thought of the Python
equivalent of Microsoft Excel. It abstracts the notion of the spreadsheet, allowing the user to use powerful and robust analytical tools generally to automate repeated processes.
The twin centerpieces of the Pandas
library are the Series
and the DataFrame
. The Series
class is, at its core, a one-dimensional NumPy
array, surrounded by additional information, such as its index. The DataFrame
is conceptually an array of Series
classes, each sharing the same index.
In [ ]:
from pandas import Series, DataFrame
import pandas as pd
We will be using Wes McKinney's GitHub notebook as a skeleton. He imports the following libraries for later use:
In [ ]:
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)
Consider the following input:
In [ ]:
obj = Series([4, 7, -5, 3])
obj
We have set the variable obj
to reference a new Pandas Series
, which we initialized by giving a Python
list as input. Notice that Pandas
automatically interprets the input data as type int64
, which indicates that it is fairly smart! Also, notice that upon printing obj
we see two columns. The first column is the index of the Series
class, which is presently the natural index, range(4)
. The second column is the input data that we gave initially, which Pandas
refers to as the values of obj
. You can access these columns individualy by calling obj.index
and obj.values
, respectively. For example:
In [ ]:
print obj.index, "\n\n"
print obj.values
In [ ]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
What happens when you examine the index now?
In [ ]:
obj2.index
Don't be alarmed by the u
prefix to each of the index values. In Python
(as well as in other languages, this simply indicates a Unicode string. Ostensibly, there is no difference between normal strings and Unicode strings.
You can access a particular member of the Series
data by specifying its index. For example,
In [ ]:
obj2['a']
This allows you to change the value of specific entries in your Series
data. Additionally, you can call subSeries
by specifying a sublist of the index.
In [ ]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]
A powerful tool in Pandas
is the ability to concisely access data meeting Boolean
qualifications. In the case below, obj2 > 0
is given as the "index," and the output is the subSeries
of obj2
for which all entries are positive.
In [ ]:
obj2[obj2 > 0]
In [ ]:
obj2 > 0
This is actually a neat property of Pandas
which is similar to NumPy
. In NumPy
, suppose you are given an array:
In [ ]:
arr = np.random.rand(5) * 2.0 - 1.0
The actual array itself is given by
In [ ]:
arr
The Boolean
array specifying which elements of arr
are positive is given by
In [ ]:
boolArr = arr > 0.0
boolArr
In [ ]:
obj2 * 2
returns a Series
whose values are doubled, and
In [ ]:
np.exp(obj2)
returns a Series
whose values have all been subject to the transformation $x\mapsto e^x$. Notice additionally that the dtype
of obj2
has automatically been changed from int64
to float64
. Again, Pandas
is being smart!
In Python
, there is a binary operator called in
, which takes two "arguments." The left-hand argument is can be any type of data (or object, we won't get into this), while the right-hand argument is some type of iterable object. Then in
returns True
if the left-hand argument is an element of the right-hand argument. Mathematically, this is equivalent to set membership. For example,
In [ ]:
odds = [i for i in range(20) if i%2 == 1]
print 3 in odds, "|", 2 in odds
is equivalent to noting that if $$\text{Odds} = \{ n : 0 \leq n < 20 \text{ and } n \text{ is odd}\}$$ we have that $$3 \in \text{Odds}$$ while $$2 \notin \text{Odds}.$$
(In fact, we have already seen in
in action with Python
's for
loop, which has the form
for element in iterative_object:
indicating that the code should loop through every element element
that is a member of iterative_object
.)
You can use in
with Pandas
Series
to test that an element is a member of the index of the Series
. For example,
In [ ]:
'b' in obj2
In [ ]:
'e' in obj2
We have talked about the native Python
list data type. There is another important native data type in Python
, called a dict
, which you can learn about more here. Python
dict
types are similar to association lists in Scheme
, in that they require a lookup key in order to access elements.
Crucially, Pandas
can create a Series
from a dict
by interpreting the key for each item as its corresponding index value, which is actually quite natural. In this sense, I find that it is useful to think of the relationship between NumPy
and Pandas
as akin to the relationship between a list and a dict
.
In [ ]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3
What happens when you use an existing dataset with a new index, in which there is a new, unfilled index?
In [ ]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
In this case, California
is a previously-unused index, which has no corresponding value. Thus, Pandas
initializes the new Series
with the value corresponding to California
set to NaN
(Python
-speak for null).
The isnull
method returns a Series
of Boolean
values whenever the original Series
has a null (NaN
) value.
In [ ]:
pd.isnull(obj4)
The notnull
method does the exact opposite!
In [ ]:
pd.notnull(obj4)
The methods isnull
and notnull
are "static" in the sense that they can be called straight from the pd
module or for a specific Series
object.
In [ ]:
obj4.isnull()
Recall the two Series
, obj3
and obj4
:
In [ ]:
print "\tobj3:\n",obj3, "\n\n\tobj4:\n", obj4
Arithmetic operations between distinct Series
objects work conservatively. For data types, int64 + float64 = float64
to preserve the decimal information. The summed index is the union of the two indices. Consider the following example:
In [ ]:
obj3 + obj4
No entry for California
exists in obj3
, while no entry for Utah
exists in obj4
. Pandas
interprets NaN + x = NaN
for all x
, so the resultant Series
sets NaN
for both California
and Utah
.
We can set some metadata for a Series
, such as the name of the values column and the name of the index column.
In [ ]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4
You can also completely change the index at any time. This is something we will get into more detail later.
In [ ]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Like we said before, you can think of a DataFrame
as an array of Series
objects. Specifically, a DataFrame
is a two-dimensional array of Series
objects, all indexed by the same index series. You can also think of a DataFrame
as a single Microsoft Excel spreadsheet.
One way to initialize a DataFrame
is by giving a dict
where each key indicates a Python
list.
In [ ]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame
You can reorder the columns in a new DataFrame
using the following argument:
In [ ]:
DataFrame(data, columns=['year', 'state', 'pop'])
Similarly, the index
optional argument in DataFrame
allows you to specify the index list. Additionally, adding a debt
column with no corresponding data in data
will initialize a column filled with NaN
entries.
In [ ]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
frame2
You can access the columns of a DataFrame
as follows:
In [ ]:
frame2.columns
You can slice a particular column by specifying its column name. Notice how this returns a Series
.
In [ ]:
frame2['state']
Alternatively, you can slice a column using the following syntax:
In [ ]:
frame2.year
To slice a row, you can specify an index, which will return a Series
representing the row at the index.
In [ ]:
frame2.ix['three']
Broadcasting works in the natural way that you might expect:
In [ ]:
frame2['debt'] = 16.5
frame2
You can also give a particular column a list or ndarray
, which will then be distributed across the column.
In [ ]:
frame2['debt'] = np.arange(5.)
frame2
Finally, you can give a column of a DataFrame
a Series
. If you specify a Series
with an index differing from the main DataFrame
, then the entries of the DataFrame
will be set to NaN
.
In [ ]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2
The point of Pandas
is there are numerous ways to achieve the same effect, depending on whatever is easiest for the task at hand. Here is another way to add a column:
In [ ]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2
We can also use Python
's del
function to remove a column:
In [ ]:
del frame2['eastern']
frame2.columns
One final way to initialize DataFrame
objects is with nested dict
objects.
In [ ]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3
You can transpose a DataFrame
if it makes more sense to work with the rows and columns flipped.
In [ ]:
frame3.T
You can do this transpose operation from the outset by manually specifying the index.
In [ ]:
DataFrame(pop, index=[2001, 2002, 2003])
DataFrame
objects can also be initialized from dict
s of Series
objects.
In [ ]:
pdata = {'Ohio': frame3['Ohio'][:-1],
'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)
In [ ]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3
If you need to access the underlying ndarray
from any DataFrame
, use the DataFrame.values
field.
In [ ]:
frame3.values
In [ ]:
frame2.values
The Index
is the "metadata" object for Series
and DataFrame
objects. We've seen ways of initializing Index
objects before, so we will go over some features of these objects.
In [ ]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index
objects can be sliced like arrays.
In [ ]:
index[1:]
Importantly, Index
objects are not mutable, so you can't change their values in the natural way:
In [ ]:
index[1] = 'd'
You can initialize Index
objects with NumPy
ndarray
objects.
In [ ]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index
In [ ]:
frame3
In [ ]:
print 'Ohio' in frame3.columns, "|", 2003 in frame3.index
Now that we are familiar with the basic objects in Pandas
, we will start working with the mechanics of these objects.
In the previous section we mentioned that Index
objects are immutable. Here we will address this issue.
In [ ]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
The simplest way to change an Index
object in an existing Series
or DataFrame
is with the reindex
method.
In [ ]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
In the above example, since "e"
was not in the original Index
, the corresponding Series
value is set to NaN
. If you want to change the default fill value, reindex
can take an additional parameter, fill_value
.
In [ ]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
A different approach uses a method
parameter that attempts to extrapolate existing data into the new Index
. One such method is ffill
, which "step-fills" the existing data forward. Alternatively, bfill
"step-fills" the data backwards.
In [ ]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')
The reindex
method works for DataFrame
objects as well. For DataFrame
objects, reindex
can also specify column reindexing.
In [ ]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
frame
In [ ]:
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
columns=states)
Alternatively, you can use ix
to achieve the same effect more concisely.
In [ ]:
frame.ix[['a', 'b', 'c', 'd'], states]
Suppose you have a Series
object with data you wish to remove. Using the drop
method, you can specify an index element to remove.
In [ ]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj
You can also drop a list of index elements at once.
In [ ]:
obj.drop(['d', 'c'])
The same works for DataFrame
objects and the drop
method.
In [ ]:
data = DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
In [ ]:
data.drop(['Colorado', 'Ohio'])
Additionally, DataFrame.drop()
can remove columns by specifying an axis
parameter.
In [ ]:
data.drop('two', axis=1)
In [ ]:
data.drop(['two', 'four'], axis=1)
In this section we will explore the various techniques available for slicing Series
and DataFrame
objects. One the one hand, we can deal with these objects as dict
structures, accessing elements by requesting their index keys. On the other hand, we can treat these objects as list structures, accessing elements by the order of the index list.
In [ ]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b']
In [ ]:
obj[1]
This flexibility allows you to incorporate all of the previous array slicing that worked for NumPy ndarray
objects.
In [ ]:
obj[2:4]
Conversely, you can use a list of dict
keys to achieve the same end.
In [ ]:
obj[['b', 'a', 'd']]
Here are some alternative slicing techniques for Series
objects.
In [ ]:
obj[[1, 3]]
In [ ]:
obj[obj < 2]
In [ ]:
obj['b':'c']
You can assign values to sub-objects which then reflect on the original object.
In [ ]:
obj['b':'c'] = 5
obj
The same capabilities are extended to the DataFrame
objects. The added flexibility is that the same indexing techniques also apply to column slicing as well as index slicing.
In [ ]:
data = DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
In [ ]:
data['two']
In [ ]:
data[['three', 'one']]
The natural slicing will always refer to the index list, not the column list, which is useful to keep in mind.
data[:2]
In [ ]:
data[data['three'] > 5]
Recall that you can generate a corresponding Boolean
array by subjecting a DataFrame
to a boolean statement, such as the following:
In [ ]:
data < 5
You can use Boolean
arrays to do simple thresholding to your data. You can isolate entries in your data subject to identical Boolean
conditions, and manipulate these specific subsets of the data.
In [ ]:
data[data < 5] = 0
data
The DataFrame.ix
field gives you even more powerful ways to slice your data. In general, slicing works by providing two arguments, an index and a column specification, and it will then return that particular subset.
In [ ]:
data.ix['Colorado', ['two', 'three']]
You can overload requests by using a list of index or column elements. Additionally, you may reorder the indices or columns in your subset by permuting the order of the specified elements, so long as they exist in the original DataFrame
.
In [ ]:
data.ix[['Colorado', 'Utah'], [3, 0, 1]]
The ix
approach is very powerful. See if you can work through the mechanics of the next few examples to see just how versatile slicing with ix
actually is.
In [ ]:
data.ix[2]
In [ ]:
data.ix[:'Utah', 'two']
In [ ]:
data.ix[data.three > 5, :3]
As we mentioned before, we can do arithmetic on Series
and DataFrame
objects.
In [ ]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
print s1, "\n\n"
print s2
Importantly, arithmetic is only performed on elements sharing an index. If either object has an index value that the other does not, the arithmetic operation is undefined, so the resultant object contains an NaN
element.
In [ ]:
s1 + s2
The same holds for DataFrame
arithmetic, except now it requires that both the index and column of each DataFrame
object is well-defined.
In [ ]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print df1, "\n\n"
print df2
In [ ]:
df1 + df2
Often NaN
values are undesirable, as they can cause errors when doing arithmetic operations on the data.
In [ ]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
print df1, "\n\n"
print df2
In [ ]:
df1 + df2
This can be avoided by using the built-in DataFrame.add()
method, which takes as parameters a DataFrame
object and an optional fill_value
which deals with otherwise NaN
entries.
In [ ]:
df1.add(df2, fill_value=0)
In fact, most DataFrame
organization methods take fill_value
as a parameter to deal with undefined cases, such as reindex
.
In [ ]:
df1.reindex(columns=df2.columns, fill_value=0)
Broadcasting NumPy
arrays is a very useful technique for performing arithmetic operations concisely. And efficiently, actually. This is because while normal Python
arithmetic is interpreted, NumPy
arithmetic is based on compiled C
code, which is much more efficient in general.
In [ ]:
arr = np.arange(12.).reshape((3, 4))
arr
Normally we think of broadcasting a scalar element onto a one-dimensional array vector. In fact, broadcasting is much more powerful, because you can broadcast an array over a bigger array.
In [ ]:
arr[0]
In [ ]:
arr - arr[0]
DataFrame
and Series
objects work along similar lines.
In [ ]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
print frame, "\n\n"
print series
You can broadcast the values in a Series
over its parent DataFrame
as you would with NumPy ndarrays
.
In [ ]:
frame - series
Of course, if either a Series
and DataFrame
object has index or column values the other does not, the undefined arithmetic simply is sent to NaN
. (We discussed ways to avoid this issue in the previous sections).
In [ ]:
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2
In [ ]:
series3 = frame['d']
print frame, "\n\n"
print series3
Using the built-in DataFrame
arithmetic operations such as add
or sub
gives the option to specify the axis (0: index, 1: columns) over which the arithmetic will take place (again, you can use fill_value
to avoid potential NaN
values).
In [ ]:
frame.sub(series3, axis=0)
One of the most important capabilities of Series
and DataFrame
is the ability to apply function transformations to the data. Every ufunc
defined by NumPy
can be applied to a DataFrame
(or Series
) object.
In [ ]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
For example, you can apply a nonnegativity transform by including a built-in NumPy
absolute value.
In [ ]:
np.abs(frame)
You can define and apply custom functions in two fashions. One is by using lambdas to construct anonymous functions:
In [ ]:
frame.apply(lambda x: x.max() - x.min())
In [ ]:
frame.apply(lambda x: x.max() - x.min(), axis=1)
Alternatively, you can define your own unary function and simply apply it using the same overall approach.
In [ ]:
def f(x):
return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
For presentations and general readability, it is useful to format decimal or date values into condensed forms, and Pandas
lets you achieve this by using the applymap
method for DataFrame
and Series
objects. The difference between apply
and applymap
is rather subtle and often functionally neglibible, but the idea is that apply
works on a particular subsets of rows or columns, whereas applymap
is element-wise.
In [ ]:
format = lambda x: '%.2f' % x
frame.applymap(format)
Alternatively, you can use the built-in Python
map
function.
In [ ]:
frame['e'].map(format)
One fundamental problem in data analysis, let alone computer science in general, is sorting data. Pandas
provides a number of techniques for sorting information in the index, columns, and the actual data itself.
The first technique is sort_index
, which is a method for both Series
and DataFrame
objects. For Series
objects, sort_index
works as follows:
In [ ]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
Since there is only one meaningful index to sort, the labels, sort_index
is a very intuitive method. I want to point out that sort_index
does not have side-effects; that is, calling sort_index
on an object does not actually change the internals of the object itself. Instead, a sorted copy of the original object is produced.
The method sort_index
works similarly with DataFrame
objects, but now there are two potential axes along which to sort. The default is the index
, as we see below:
In [ ]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
frame.sort_index()
By specifying the axis as a parameter, one can choose the columns instead. (Recall that in Python
everything begins at 0, so the second axis corresponds to axis number 1).
In [ ]:
frame.sort_index(axis=1)
The sort_index
method also allows you to flip the ordering by specifying the ascending
parameter.
In [ ]:
frame.sort_index(axis=1, ascending=False)
If you want to sort the elements themselves, as opposed to the index, Pandas
provides the order
method for Series
objects.
In [ ]:
obj = Series([4, 7, -3, 2])
obj.order()
By default, NaN
values are placed at the end upon sorting the Series
.
In [ ]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.order()
For DataFrame
objects you can specify the index or column you wish to sort. Additionally, if your data set is properly constructed, you can sort by two columns or indices, as the below example exhibits:
In [ ]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame.sort_index(by='b')
In [ ]:
frame.sort_index(by=['a', 'b'])
It is possible for a Series
or DataFrame
object not to have a unique index. For example:
In [ ]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
Pandas
has a field for the index of any object to indicate whether or ot the index is unique (no duplicate indices).
In [ ]:
obj.index.is_unique
If an index is not unique, then slicing the object for a repeated index returns a sub-object. For example:
In [ ]:
obj['a']
For unique index items, the default return-type is a scalar:
In [ ]:
obj['c']
The same goes for DataFrame
objects, although they are more complicated.
In [ ]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
Remember that you have to slice the index using ix
, and you will observe the same behavior.
In [ ]:
df.ix['b']
Oftentimes, you need a quick way to come up with basic summary statistics of data sets. The solution that Pandas
provides is incredibly robust, especially with regard to NaN
entries.
In [ ]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
df
By default, the sum
method will skip NaN
entries for each column in a DataFrame
.
In [ ]:
df.sum()
For the DataFrame
object, you can also apply along either the index axis or the column axis. Again, sum
will skip over NaN
elements when arriving at a value.
In [ ]:
df.sum(axis=1)
If you don't want this behavior, you can always tell the statistics function you are applying not to skip the NaN
entries. Here is an example using mean
:
In [ ]:
df.mean(axis=1, skipna=False)
Another useful statistic is idxmax
, which returns the index of the maximum value of a column in a DataFrame
.
In [ ]:
df.idxmax()
One incredibly useful method is cumsum
, which has a number of important applications in the analysis of probability distributions and random walks.
In [ ]:
df.cumsum()
You can also get a quick overview of all of the summary statistics of a DataFrame
simply by calling the describe
method.
In [ ]:
df.describe()
Method | Description |
---|---|
count |
Number of non-NaN values |
describe |
Compute set of summary statistics for Series or each DataFrame column |
min , max |
Compute minimum and maximum values |
argmin , argmax |
Compute index locations for minimum and maximum values |
idmin , idmax |
Compute index values for minimum and maximum values |
quantile |
Compute sample quantile ranging from 0 to 1 |
sum |
Sum of values |
mean |
Mean of values |
median |
Arithmetic median of values |
mad |
Mean absolute deviation from mean value |
var |
Sample variance of values |
std |
Sample standard deviation of values |
skew |
Sample skewness (3rd moment) of values |
kurt |
Sample kurtosis (4th moment) of values |
cumsum |
Cumulative sum of values |
cummin , cummax |
Cumulative min and max of values |
cumprod |
cumulative product of values |
diff |
Compute 1st arithmetic difference (useful for time series |
pct_change |
Compute percent changes |
Series
objects also have a describe
method. The describe
method outputs statistics based on the dtype
of the underlying object. In the above example, df
had a dtype
of float64
, so describe
produced information pertinent to floating-point numerics. In the below example, the Series
object has a dtype
of object
, which results in different summary statistics.
In [ ]:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()
One common problem in data analysis, especially in the analysis of time series data like historical prices for financial securities, is correlation and covariance analysis. To this end Pandas
has a number of features to make the analysis simple.
Here is one example, using a built-in data aggregator using Yahoo! Finance in the Pandas
API. Returns on a stock are defined as the percent change in the stock's closing value from day-to-day.
In [ ]:
import pandas.io.data as web
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'CSCO']:
all_data[ticker] = web.get_data_yahoo(ticker)
price = DataFrame({tic: data['Adj Close']
for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
for tic, data in all_data.iteritems()})
returns = price.pct_change()
returns.tail()
When given a Series
object, the corr
method computes the scalar correlation between the Series
and another Series
.
In [ ]:
returns.MSFT.corr(returns.IBM)
By contrast, corr
and cov
returns a correlation and covariance matrix DataFrame
with filled correlation and covariance values, respectively.
In [ ]:
returns.MSFT.cov(returns.IBM)
In [ ]:
returns.corr()
In [ ]:
returns.cov()
The corrwith
method computes pairwise correlations and stores the resultant in a Series
. Note that the correlation between IBM and IBM is 1.
In [ ]:
returns.corrwith(returns.IBM)
Passing a DataFrame
instead computes correlation with like-columns.
In [ ]:
returns.corrwith(volume)
Given data with repeats, you can eliminate the excess by using the unique
method.
In [ ]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques
The value_counts
returns a Series
with an index made up of the unique entries in the original Series
, and the new entries give the total appearances of each value.
In [ ]:
obj.value_counts()
You can perform set-membership operations to, for example, construct masks which you can then apply to your original data.
In [ ]:
mask = obj.isin(['b', 'c']) # This forms a Series object of Boolean values
obj[mask]
For DataFrame
objects, you can apply the value_counts
method to each subseries, producing a new DataFrame
of frequency statistics.
In [ ]:
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
data.apply(pd.value_counts).fillna(0)
One of the primary problems with data analysis is the prevalence of missing data. In many cases, arithmetic operations, summary statistics, and other functions require that your data be intact in order to provide meaningful results. Pandas
gives a number of functions to address the problem of missing data, allowing you to filter it out easily.
Consider this Series
of string
values.
In [ ]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
The isnull
method identifies every NaN
entry. Alternatively,notnull
will identify every non-NaN
entry.
In [ ]:
string_data.isnull()
In [ ]:
string_data[0] = None
string_data.notnull()
A simple way to remove missing entries from a Series
object is to use dropna
.
In [ ]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()
Alternatively, you can use Boolean
Series
and notnull
to mask the original data.
In [ ]:
data[data.notnull()]
DataFrame
objects are trickier. For example, how should Pandas
handle a mostly-complete row? The correct answer is ambiguous. By default, dropna
will eliminate any row with a NaN
(we redefined NaN
to NA
here) value.
In [ ]:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data
In [ ]:
cleaned
Alternatively, you can require that a row be eliminated only if it is completely empty.
In [ ]:
data.dropna(how='all')
You can also specify columns for deletion. Again, you can change the deletion requirements as needed.
In [ ]:
data[4] = NA # fill a column entirely with NA
data.dropna(axis=1, how='all')
The dropna
method is very robust. You can also specify a minimum threshold of data in a particular row as a criterion for deletion. In the next example, we threshold at 2 entries per row, allowing rows with one NaN
value to stay while deleting any more patchy rows.
In [ ]:
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA
df
In [ ]:
df.dropna(thresh=2)
Instead of eliminating missing data outright, Pandas
lets you fill in the missing values. The simple approach, using fillna
, is to pass a value that will then replace every NaN
entry.
In [ ]:
df.fillna(0)
Alternatively, you can specify different fill values in different columns by giving a dict
with keys of column names.
In [ ]:
df.fillna({1: 0.5, 2: -1})
Using the inplace
argument, you can overwrite the original DataFrame
object.
In [ ]:
# always returns a reference to the filled object
_ = df.fillna(0, inplace=True)
df
The other main filling technique is to fill by procedure. ffill
will copy the previous value in a column into the NaN
entry.
In [ ]:
df = DataFrame(np.random.randn(6, 3))
df.ix[2::2, 1] = NA; df.ix[4:, 2] = NA
print df, "\n\n"
print df.fillna(method='ffill')
In cases where you don't want this to extend indefinitely, you can limit the fill method to a certain number of NaN
entries after the last available one.
In [ ]:
df.fillna(method='ffill', limit=1)
From "Python for Data Analysis":
Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form. Let's start with a simple example; create a
Series
with a list of lists or arrays as the index:
In [ ]:
data = Series(np.random.randn(10),
index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data
Notice how the index a
corresponds to the sub-indices 1
and 2
, and their corresponding data. The index
object is thus not a simple list but a series of lists corresponding to the inner sub-indices.
In [ ]:
data.index
Accessing an outer label will give you the sub-Series
that it corresponds to.
In [ ]:
print data['b'], "\n\n"
print data['b':'c']
You can access sub-indices, which returns the Series
of all upper indices and their corresponding values.
In [ ]:
data[:, 2]
You can use unstack
to take the multi-index and place it into a DataFrame
object.
In [ ]:
data.unstack()
The inverse of unstack
is stack
. Observe:
In [ ]:
data.unstack().stack()
Multi-indexing has a similar logic with DataFrame
objects, but it becomes more complicated as both the index and the columns can be given a hierarchy:
In [ ]:
frame = DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
frame
For clarity, let's rename the labels so we know what level we are looking at.
In [ ]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
Now by specifying any column, whether on the top level or any sublevel, you can get the DataFrame
of values corresponding to the name.
In [ ]:
frame['Ohio']
MultiIndex
objects are independent in Pandas
, meaning that you can create them without a corresponding DataFrame
and reuse them as needed.
In [ ]:
from pandas import MultiIndex
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])
You can always swap indices on the same level. For example, if you want key2
and key1
to switch, you can write
In [ ]:
frame.swaplevel('key1', 'key2')
Additionally, you can sort a particular index (in general, you can't sort them all). Specify the index by its order (first is 0, second is 2), and you will see the sort take place:
In [ ]:
frame.sortlevel(1)
As with the object-oriented paradigm, you can combine these actions into one statement. For example:
In [ ]:
frame.swaplevel(0, 1).sortlevel(0)
With hierarchical indexing, you can specify the level and axis with which to compute summary statistics. If one wants to compute the sum of all values in the key2
index, you get the relevant sub-DataFrame
.
In [ ]:
frame.sum(level='key2')
This of course gets extended to the columns as well, which you have grown accustomed to with DataFrame
methods.
In [ ]:
frame.sum(level='color', axis=1)
In the examples above we showed how to stack
and unstack
Series
objects into DataFrames
. But in general DataFrame
objects give you a lot of discretion regarding which columns you want to convert into indices.
In [ ]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
frame
You can overload set_index
with more than one column to produce a hierarchical index using the values of each respective column.
In [ ]:
frame2 = frame.set_index(['c', 'd'])
frame2
Crucially, the default Pandas
behavior is to remove the indexed columns. You can force Pandas
to keep the old columns by specifying the drop
parameter:
In [ ]:
frame.set_index(['c', 'd'], drop=False)
In [ ]:
frame2.reset_index()